30 research outputs found
Differentiable pooling for unsupervised speaker adaptation
This paper proposes a differentiable pooling mechanism to perform model-based neural network speaker adaptation. The proposed tech-nique learns a speaker-dependent combination of activations within pools of hidden units, was shown to work well unsupervised, and does not require speaker-adaptive training. We have conducted a set of experiments on the TED talks data, as used in the IWSLT evalu-ations. Our results indicate that the approach can reduce word error rates (WERs) on standard IWSLT test sets by about 5–11 % relative compared to speaker-independent systems and was found comple-mentary to the recently proposed learning hidden units contribution (LHUC) approach, reducing WER by 6–13 % relative. Both methods were also found to work well when adapting with small amounts of unsupervised data – 10 seconds is able to decrease the WER by 5% relative compared to the baseline speaker independent system
Neural networks for distant speech recognition
Distant conversational speech recognition is challenging ow-ing to the presence of multiple, overlapping talkers, additional non-speech acoustic sources, and the effects of reverberation. In this paper we review work on distant speech recognition, with an emphasis on approaches which combine multichan-nel signal processing with acoustic modelling, and investi-gate the use of hybrid neural network / hidden Markov model acoustic models for distant speech recognition of meetings recorded using microphone arrays. In particular we investi-gate the use of convolutional and fully-connected neural net-works with different activation functions (sigmoid, rectified linear, and maxout). We performed experiments on the AMI and ICSI meeting corpora, with results indicating that neu-ral network models are capable of significant improvements in accuracy compared with discriminatively trained Gaussian mixture models. Index Terms — convolutional neural networks, distant speech recognition, rectifier unit, maxout networks, beam-forming, meetings, AMI corpus, ICSI corpus 1
Differentiable Pooling for Unsupervised Acoustic Model Adaptation
We present a deep neural network (DNN) acoustic model that includes
parametrised and differentiable pooling operators. Unsupervised acoustic model
adaptation is cast as the problem of updating the decision boundaries
implemented by each pooling operator. In particular, we experiment with two
types of pooling parametrisations: learned -norm pooling and weighted
Gaussian pooling, in which the weights of both operators are treated as
speaker-dependent. We perform investigations using three different large
vocabulary speech recognition corpora: AMI meetings, TED talks and Switchboard
conversational telephone speech. We demonstrate that differentiable pooling
operators provide a robust and relatively low-dimensional way to adapt acoustic
models, with relative word error rates reductions ranging from 5--20% with
respect to unadapted systems, which themselves are better than the baseline
fully-connected DNN-based acoustic models. We also investigate how the proposed
techniques work under various adaptation conditions including the quality of
adaptation data and complementarity to other feature- and model-space
adaptation methods, as well as providing an analysis of the characteristics of
each of the proposed approaches.Comment: 11 pages, 7 Tables, 7 Figures in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 24, num. 11, 201
Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition
This paper presents an extension to train end-to-end Context-Aware
Transformer Transducer ( CATT ) models by using a simple, yet efficient method
of mining hard negative phrases from the latent space of the context encoder.
During training, given a reference query, we mine a number of similar phrases
using approximate nearest neighbour search. These sampled phrases are then used
as negative examples in the context list alongside random and ground truth
contextual information. By including approximate nearest neighbour phrases
(ANN-P) in the context list, we encourage the learned representation to
disambiguate between similar, but not identical, biasing phrases. This improves
biasing accuracy when there are several similar phrases in the biasing
inventory. We carry out experiments in a large-scale data regime obtaining up
to 7% relative word error rate reductions for the contextual portion of test
data. We also extend and evaluate CATT approach in streaming applications.Comment: 5 pages, 2 figures, 2 table